!pip install opencv-python
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from glob import glob
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
##Import any other packages you may need here
import cv2
import pydicom
import scipy.stats
EDA is open-ended, and it is up to you to decide how to look at different ways to slice and dice your data. A good starting point is to look at the requirements for the FDA documentation in the final part of this project to guide (some) of the analyses you do.
This EDA should also help to inform you of how pneumonia looks in the wild. E.g. what other types of diseases it's commonly found with, how often it is found, what ages it affects, etc.
Note that this NIH dataset was not specifically acquired for pneumonia. So, while this is a representation of 'pneumonia in the wild,' the prevalence of pneumonia may be different if you were to take only chest x-rays that were acquired in an ER setting with suspicion of pneumonia.
Perform the following EDA:
Note: use full NIH data to perform the first a few EDA items and use sample_labels.csv for the pixel-level assassements.
Also, describe your findings and how will you set up the model training based on the findings.
Q.1 What type of images were used in this dataset ?
Ans - GrayScale
Q.2 What body parts were examined in this dataset ?
Ans - Chest
Q.3 What was the age range of patients that were tested positive with Pneumonia?
Ans - Age Between 0 - 100
Q.4 What type of image positions were present in this dataset ?
Ans - AP and PA
Q.5 What was the total count of patients with penumonia in this dataset ?
Ans - 1431
Q.6 What type of diseases are most common comorbidities of pneumonia ?
Ans - The top 5 most common co occurring comorbidities with pneumonia present in this dataset are:-
1. Infiltration with Pneumonia
2. Edema, Infiltration with Pneumonia
3. Atelectasis with Pneumonia
4. Edema with Pneumonia
5. Effusion with Pneumonia
Based on the frequency these are the diseases that most commonly occur with pneumonia:-
1. Infiltration
2. Edema
3. Effusion
4. Atelectasis
Q.7 Were you able to spot any differences in pixel intensities with patients which have pneumonia and patients which are healthy ? Ans - On examining the pixel intensities of 3 differrent patients that were healthy(No Finding):- Here the first value represents the image mean while the second value reprsents image standard deviation.
1. 131.74216710869442, 51.71516257188659
2. 169.66516806341969, 51.25696493442237
3. 151.1333099500918, 53.89837944789393
While the pixel intensities of 3 differrent patients that were tested positive with pneumonia:- Here the first value represents the image mean while the second value reprsents image standard deviation.
1. 123.59639087519794, 40.048603370246795
2. 129.08860275952935, 50.911206559341856
3. 151.73789234004076, 54.98203429309973
Conclusion:- When comparing the above result we see that there is not much difference between the mean and standard deviation of a image in healthy patient and a patient that was tested positive with Pneumonia.
Q.8 What was the difference between pixel intensities of Pneumonia versus all the other diseases, what conclusion can be drawn from examing them ?
Note:- Kindly to see the results that are reffered to in this explanation look at the list of means and standard deviation provided in the last block.
Ans - As we can from the results the presence of other diseases in the chest X ray Scan may lower the accuracy of our model as the mean and standard deviation of the diseases are very close to each other. While the mean of the disease are very close to each other Hernia, Nodule and Atelectasis have a slighly higher mean in some of the images. Also the thing to notice is that the mean of the labels Infiltrationa and Edema are almost same as mean of the images which contain Pneumonia this may lower the accuracy of the model if these images are present in the dataset.
Conclusion:- While the presence of other images in the chest X ray can lower the accuracy of model the most difficult ones to differentitate are Infiltration and Edema as they have the same image mean as that of pimages that have pneumonia.
## Below is some helper code to read data for you.
## Load NIH data
all_xray_df = pd.read_csv('/data/Data_Entry_2017.csv')
all_xray_df.dropna(axis=1, how='all')
all_xray_df.dropna(axis=1, inplace=True)
all_image_paths = {os.path.basename(x): x for x in glob(os.path.join('/data','images*/images', '', '*.png'))}
print('Scans found:', len(all_image_paths), ', Total Headers', all_xray_df.shape[0])
all_xray_df['Image Path'] = all_xray_df['Image Index'].map(all_image_paths.get)
all_xray_df.sample(3)
## EDA
# Demographic data for Patient Gender
plt.figure(figsize=(6,6))
all_xray_df['Patient Gender'].value_counts().plot(kind='bar')
# Demographic data for Patient Age
plt.figure(figsize=(6,6))
plt.xlim(0, 150)
plt.hist(all_xray_df['Patient Age'])
# Demographic data for Patient Position
plt.figure(figsize=(6,6))
all_xray_df['View Position'].value_counts().plot(kind='bar')
#Sample X ray Images from the dataset
j = 1
plt.figure(figsize=(16,16))
for i in all_xray_df.sample(3)["Image Path"]:
plt.subplot(1,3,j)
j=j+1
plt.imshow(cv2.imread(i))
#Pneumonia Count
pneumonia_count = len(all_xray_df[all_xray_df['Finding Labels'].str.contains("Pneumonia")])
non_pneumonia_count = len(all_xray_df['Finding Labels'])- pneumonia_count
print("The total number of Pnemounia Cases in the dataset are:- " + str(pneumonia_count))
print("The total number of Non Pnemounia Cases in the dataset are:- " + str(non_pneumonia_count))
# Finding Unique Label Names
from itertools import chain
all_labels = np.unique(list(chain(*all_xray_df['Finding Labels'].map(lambda x: x.split('|')).tolist())))
all_labels = [x for x in all_labels if len(x)>0]
print('All Labels ({}): {}'.format(len(all_labels), all_labels))
for c_label in all_labels:
if len(c_label)>1: # leave out empty labels
all_xray_df[c_label] = all_xray_df['Finding Labels'].map(lambda finding: 1.0 if c_label in finding else 0)
all_xray_df.sample(3)
all_xray_df[all_labels].sum()/len(all_xray_df)
plt.figure(figsize=(16,6))
ax = all_xray_df[all_labels].sum().plot(kind='bar')
ax.set(ylabel = 'Number of Images with Label')
## Co occurring Diseases
plt.figure(figsize=(16,6))
all_xray_df[all_xray_df.Pneumonia==1]['Finding Labels'].value_counts().plot(kind='bar')
##Since there are many combinations of potential findings, We will be looking at the 30 most common co-occurrences:
plt.figure(figsize=(16,6))
all_xray_df[all_xray_df.Pneumonia==1]['Finding Labels'].value_counts()[0:30].plot(kind='bar')
##Frequency Distribution Label wise of Each disease that co occurr with Pneumonia
diseases_freq ={}
for i in all_labels:
if i == "No Finding":
continue
diseases_freq[i] = len(all_xray_df[all_xray_df.Pneumonia==1][all_xray_df[i]==1])
diseases_freq = {k: v for k, v in sorted(diseases_freq.items(), key=lambda item: item[1], reverse=True)}
plt.figure(figsize=(16,6))
plt.bar(range(len(diseases_freq)), list(diseases_freq.values()), align='center')
plt.xticks(range(len(diseases_freq)), list(diseases_freq.keys()), rotation= 45)
plt.title("Frequency Distribution Label wise of Each disease that co occurr with Pneumonia")
plt.show()
#Age distribution for Pneumonia
plt.figure()
plt.hist([all_xray_df[all_xray_df["Pneumonia"]==1]['Patient Age'].values], bins = 10, range=[0, 120])
#Gender distribution for Pneumonia
plt.figure()
all_xray_df[all_xray_df.Pneumonia==1]['Patient Gender'].value_counts().plot(kind='bar')
#Number of disease per patient
count_list = [(all_xray_df.iloc[i][all_labels].sum()) for i in range(0, len(all_xray_df))]
idx = pd.Index(count_list, name ='disease_count').astype('int64')
idx.value_counts()
plt.figure(figsize=(16,6))
plt.xlabel('Number Of Diseases', fontsize=14)
plt.ylabel('Number Of People', fontsize=14)
idx.sort_values().value_counts().sort_index().plot(kind='bar')
## Load 'sample_labels.csv' data for pixel level assessments
sample_df = pd.read_csv('sample_labels.csv')
sample_df['Image Path'] = sample_df['Image Index'].map(all_image_paths.get)
sample_df.sample(3)
def show_label_images(label):
j = 1
img_list = []
plt.figure(figsize=(16,16))
for i in sample_df[sample_df["Finding Labels"] == label].sample(3)["Image Path"]:
img_list.append(i)
plt.subplot(1,3,j)
j=j+1
plt.imshow(cv2.imread(i))
plt.title(label + " Image " + str(j-1))
return img_list
def n_pixel_intensity(img_list ,label):
j = 1
plt.figure(figsize=(16,5))
mean_std_list = []
for i in img_list:
plt.subplot(1,3,j)
j=j+1
img = cv2.imread(i)
img_mask = (img > 50)
img = img[img_mask]
mean_intensity = np.mean(img)
std_intensity = np.std(img)
new_img = img.copy()
new_img = (new_img - mean_intensity)/std_intensity
mean_std_list.append([mean_intensity, std_intensity])
plt.hist(new_img.ravel(), bins=256)
plt.title(label + " (Pixel intensities)" + " Image " + str(j-1) )
return mean_std_list
# Some samples of Pixel level intensities for the labels present in the Data
mean_std_list_all = []
for i in all_labels:
img_list = show_label_images(i)
mean_std_list = n_pixel_intensity(img_list, i)
mean_std_list_all.append(mean_std_list)
# Mean and standard deviation values for the above shown graphs
for i in range (0 , len(all_labels)-1):
print (all_labels[i])
print(mean_std_list_all[i][0])
print(mean_std_list_all[i][1])
print(mean_std_list_all[i][2])
#Bounding Box Analysis for Later
# bbox = pd.read_csv('/data/BBox_List_2017.csv')
# sample_pn = bbox[bbox["Finding Label"]=="Pneumonia"].sample(1)
# sample_pn
# #full image path for given file name
# import os
# def file_path(img_name):
# file_loc = {}
# for dirs,subdirs, files in os.walk('/data/'):
# for file in files:
# file_loc[file] = os.path.join(dirs, file)
# return file_loc[img_name]
# #function for pixel level intensity
# def n_pixel_intensity(img_data, color='blue'):
# plt.figure(figsize=(5,5))
# plt.title('Normalized Image Pixel Intensity')
# img_mask = (img_data > 50)
# img_data = img_data[img_mask]
# mean_intensity = np.mean(img_data)
# std_intensity = np.std(img_data)
# new_img = img_data.copy()
# new_img = (new_img - mean_intensity)/std_intensity
# plt.hist(new_img.ravel(), bins=256, color=color);
# return mean_intensity, std_intensity
# sample_pn_img = [x for x in sample_pn["Image Index"]][0]
# img_data = cv2.imread(file_path(sample_pn_img))
# plt.imshow(img_data, cmap='gray')
# mean_intensity, std_intensity = n_pixel_intensity(img_data)
# print("mean intensity:", mean_intensity)
# print("std intensity:",std_intensity)
# from PIL import Image
# img = Image.open(file_loc[sample_pn_img])
# temp = bbox[(bbox['Image Index'] == sample_pn_img) & (bbox['Finding Label'] == "Pneumonia")]
# im = img.crop((int(temp['Bbox [x']), int(temp['y']), (int(temp['Bbox [x'])+int(temp['w'])+1), (int(temp['y'])+int(temp['h]']))))
# plt.imshow(im, cmap='gray')
# pix = np.array(im)
# mean_intensity, std_intensity = n_pixel_intensity(pix, color='red')
# print("mean intensity:", mean_intensity)
# print("std intensity:",std_intensity)
# from itertools import chain
# all_labels = np.unique(list(chain(*bbox['Finding Label'].map(lambda x: x.split('|')).tolist())))
# all_labels = [x for x in all_labels if len(x)>0]
# print('All Labels ({}): {}'.format(len(all_labels), all_labels))
# for c_label in all_labels:
# if len(c_label)>1: # leave out empty labels
# bbox[c_label] = bbox['Finding Label'].map(lambda finding: 1.0 if c_label in finding else 0)
# bbox.sample(3)